Tidyverse Other packages

Dr. Md. Zulquar Nain

tidyr - Data Tidying

  • A package in the Tidyverse for tidying data

  • Helps convert data between wide and long formats

  • Simplifies reshaping and cleaning datasets

  • Key Functions in tidyr

pivot_longer()

Converting from Wide data (multiple columns for the same variable) to Long data (one column for each variable)

library(tidyr)
  
  data_wide <- tibble(
    id = 1:3,
    `2021` = c(100, 150, 200),
    `2022` = c(110, 160, 210)
  )
  
  data_long <- data_wide %>%
    pivot_longer(cols = `2021`:`2022`, names_to = "year", values_to = "value")
  
  print(data_long)
# A tibble: 6 × 3
     id year  value
  <int> <chr> <dbl>
1     1 2021    100
2     1 2022    110
3     2 2021    150
4     2 2022    160
5     3 2021    200
6     3 2022    210

pivot_wider()

Converting from Long to Wide Format

data_long <- tibble(
  id = c(1, 1, 2, 2, 3, 3),
  year = c("2021", "2022", "2021", "2022", "2021", "2022"),
  value = c(100, 110, 150, 160, 200, 210)
)

data_wide <- data_long %>%
  pivot_wider(names_from = year, values_from = value)

print(data_wide)
# A tibble: 3 × 3
     id `2021` `2022`
  <dbl>  <dbl>  <dbl>
1     1    100    110
2     2    150    160
3     3    200    210

separate()

Splits a single column into multiple columns

data <- tibble(name = c("John Doe", "Jane Smith", "Alice Johnson"))

data_separated <- data %>%
  separate(name, into = c("first_name", "last_name"), sep = " ")

print(data_separated)
# A tibble: 3 × 2
  first_name last_name
  <chr>      <chr>    
1 John       Doe      
2 Jane       Smith    
3 Alice      Johnson  

unite()

Combines multiple columns into one

data <- tibble(first_name = c("John", "Jane", "Alice"),
               last_name = c("Doe", "Smith", "Johnson"))

data_united <- data %>%
  unite("full_name", first_name, last_name, sep = " ")

print(data_united)
# A tibble: 3 × 1
  full_name    
  <chr>        
1 John Doe     
2 Jane Smith   
3 Alice Johnson

drop_na()

Removes rows with missing values

data <- tibble(x = c(1, NA, 3), y = c(4, 5, NA))

data_clean <- data %>%
  drop_na()

print(data_clean)
# A tibble: 1 × 2
      x     y
  <dbl> <dbl>
1     1     4

fill()

Fills missing values using the previous or next available value

data <- tibble(x = c(1, NA, 3, NA, 5))

data_filled <- data %>%
  fill(x, .direction = "down")

print(data_filled)
# A tibble: 5 × 1
      x
  <dbl>
1     1
2     1
3     3
4     3
5     5

tidyr vs dplyr

  • tidyr: Specializes in reshaping and tidying data (e.g., pivot_longer(), pivot_wider(), separate()).

  • dplyr: Specializes in data manipulation, such as subsetting, grouping, and summarizing (e.g., filter(), mutate(), summarize()).

readr - Data Import/Export

  • part of the tidyverse, focused on reading and writing rectangular data (e.g., CSV, TSV).

  • Faster and more consistent than base R functions

  • Makes data handling easy and efficient.

  • Key Functions:

read_csv()

library(readr)
data_csv <- read_csv("hsbraw.csv")
head(data_csv)
# A tibble: 5 × 1
      x
  <dbl>
1     1
2    NA
3     3
4    NA
5     5

read_delim()

# Read a pipe-separated file
data_pipe <- read_delim("hsbraw.txt", delim = "|")
head(data_pipe)
# A tibble: 5 × 1
      x
  <dbl>
1     1
2    NA
3     3
4    NA
5     5

write_csv()

# Write a data frame to a CSV file
write_csv(data, "hsbraw.csv")

write_delim()

# Write data to a pipe-separated file
write_delim(data, "hsbraw.txt", delim = "|")

purrr - Functional Programming

  • enhances R’s functional programming capabilities by providing a consistent, simple way to apply functions to lists, vectors, and other data structures.

  • Key function:-

map()

Applies a function to each element of a list and returns a list.

library(purrr)

# Create a list of numbers
numbers <- list(1, 2, 3, 4, 5)

# Apply a function to square each number using map
squared_numbers <- map(numbers, ~ .x^2)

# Print the result
squared_numbers
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25

map_dbl()

To apply a function and return a vector (not a list), map_dbl() is used.

# Apply the same function, but return a vector of doubles
squared_numbers_vector <- map_dbl(numbers, ~ .x^2)

# Print the result
squared_numbers_vector
[1]  1  4  9 16 25

map2()

  • Applies a function to two inputs simultaneously.

  • useful when you need to operate on two lists.

# Create two lists
list1 <- list(1, 2, 3)
list2 <- list(10, 20, 30)

# Use map2 to add corresponding elements
sum_list <- map2(list1, list2, ~ .x + .y)

# Print the result
sum_list
[[1]]
[1] 11

[[2]]
[1] 22

[[3]]
[1] 33

map_chr()

Used to get a character vector instead of a numeric one.

# Create a list of numbers
numbers2 <- list(1, 2, 3, 4)

# Use map_chr to convert each number to a character string
char_numbers <- map_chr(numbers2, ~ as.character(.x))

# Print the result
char_numbers
[1] "1" "2" "3" "4"

purrr vs tidyr

  • purrr is specifically focused on working with lists and vectors using functional programming techniques.

  • tidyr is focused on reshaping data frames (long to wide format and vice versa), separating and combining columns.

tibble - Modern Data Frames

  • A modern version of a data frame.

  • More robust and user-friendly than traditional data frames.

  • Improved handling of large data sets

Basic Tibble

# Create a basic tibble
people_tibble <- tibble(
  Name = c("John", "Alice", "Bob"),
  Age = c(25, 30, 22),
  Location = c("New York", "London", "Paris")
)

# Print the tibble
people_tibble
# A tibble: 3 × 3
  Name    Age Location
  <chr> <dbl> <chr>   
1 John     25 New York
2 Alice    30 London  
3 Bob      22 Paris   

Tibble with Mixed Data Types

# Create a tibble with mixed data types
mixed_tibble <- tibble(
  Name = c("John", "Alice", "Bob"),
  Age = c(25, 30, 22),
  Is_Active = c(TRUE, FALSE, TRUE),
  Height = c(5.9, 5.6, 6.1)
)

# Print the tibble
mixed_tibble
# A tibble: 3 × 4
  Name    Age Is_Active Height
  <chr> <dbl> <lgl>      <dbl>
1 John     25 TRUE         5.9
2 Alice    30 FALSE        5.6
3 Bob      22 TRUE         6.1

Tibble vs. Data Frame

Data Frame

  • Displays all rows, which can be overwhelming with large datasets.
  • Automatically converts character vectors to factors (unless specified otherwise).

Tibble

  • Only shows the first few rows, making it easier to handle large datasets.

  • Does not convert character columns into factors by default, avoiding a common issue with data frames.

stringr - String manipulation

  • Provides simple functions for common text operations.

  • Focuses on consistency and ease of use.

  • Key Functions:-

str_detect()

# checks if a pattern exists in the string
library(stringr)
text <- "Hello, world!"
has_hello <- str_detect(text, "Hello")
has_hello
[1] TRUE

str_replace()

#replaces the first occurrence of a pattern with a new string
text <- "I love R programming."
new_text <- str_replace(text, "R", "Python")
new_text
[1] "I love Python programming."

str_replace_all()

# replaces all occurrences of a pattern
text <- "The cat is on the mat. The cat is cute."
new_text_all <- str_replace_all(text, "cat", "dog")
new_text_all
[1] "The dog is on the mat. The dog is cute."

str_sub()

#extracts a substring from a string
text <- "Hello, world!"
substring <- str_sub(text, 1, 5)
substring
[1] "Hello"

str_to_lower()

# converts all characters in the string to lowercase
text <- "HELLO, WORLD!"
lowercase_text <- str_to_lower(text)
lowercase_text
[1] "hello, world!"

str_split()

#splits a string into a list by a specified delimiter 
text <- "apple,banana,cherry"
split_text <- str_split(text, ",")
split_text
[[1]]
[1] "apple"  "banana" "cherry"

str_length()

#returns the number of characters in the string
text <- "Hello"
string_length <- str_length(text)
string_length
[1] 5

lubridate

  • It makes working with date-times easier

  • Provides functions to manipulate, parse, and format date-time data.

  • Simplifies operations like extracting parts of date-time, arithmetic operations, and handling time zones.

Parsing Dates

library(lubridate)
# Parse a date in year-month-day format
date1 <- ymd("2025-03-03")
print(date1)
[1] "2025-03-03"

Parsing Dates with Time

# Parse a date-time with time and time zone
datetime1 <- ymd_hms("2025-03-03 12:30:45")
print(datetime1)
[1] "2025-03-03 12:30:45 UTC"

Extracting Date-Time Components

# Extract components from a date-time object
year(datetime1)
[1] 2025
month(datetime1)
[1] 3
day(datetime1)
[1] 3
hour(datetime1)
[1] 12
minute(datetime1)
[1] 30
second(datetime1)
[1] 45

Handling Time Intervals

# Define an interval
start <- ymd_hms("2025-03-01 00:00:00")
end <- ymd_hms("2025-03-05 23:59:59")
interval1 <- interval(start, end)
print(interval1)
[1] 2025-03-01 UTC--2025-03-05 23:59:59 UTC

forcats - Working with Factors

  • Makes working with categorical variables easier.

  • Provides tools to manipulate factors and handle tasks like reordering, renaming, and combining levels.

  • Key functions :

Creating Factors

library(forcats)
# Create a factor from a vector
categories <- c("Low", "High", "Medium", "Low", "High")
factor_categories <- factor(categories)
print(factor_categories)
[1] Low    High   Medium Low    High  
Levels: High Low Medium

Reordering factor levels

# Reorder factor levels
factor_ordered <- fct_reorder(factor_categories, c(3, 2, 1, 4, 5))
print(factor_ordered)
[1] Low    High   Medium Low    High  
Levels: Medium High Low

Changing Factor Levels

# Change factor levels
new_factor <- fct_recode(factor_categories, "Very Low" = "Low", "Very High" = "High")
print(new_factor)
[1] Very Low  Very High Medium    Very Low  Very High
Levels: Very High Very Low Medium

Combining Factor Levels

# Combine similar factor levels
collapsed_factor <- fct_collapse(factor_categories, 
                                 "Low/Medium" = c("Low", "Medium"),
                                 "High" = "High")
print(collapsed_factor)
[1] Low/Medium High       Low/Medium Low/Medium High      
Levels: High Low/Medium

Working with Factor Levels

# Drop unused factor levels
dropped_factor <- fct_drop(factor_categories)
print(dropped_factor)
[1] Low    High   Medium Low    High  
Levels: High Low Medium

Visualizing Factor Data

library(ggplot2)
# Create a plot with reordered factor levels
ggplot(mpg, aes(x = fct_rev(class))) + 
  geom_bar() + 
  labs(title = "Count of Cars by Class")

forcats vs dplyr

  • forcats works with factors used for modifying factor levels, whereas dplyr also provides some functions that manipulate factors (e.g., mutate() and factor()).

  • forcats: Designed specifically for working with factors and making factor level manipulations easier.

  • dplyr: While dplyr can work with factors, forcats is more specialized for factor-specific operations.

THANKS